Bayesian Modeling Assignment

Penguin Body Mass Prediction & Fake News Detection

Author

DS 400: Bayesian Statistics

Published

November 24, 2025

🐧 Introduction: Two Bayesian Modeling Challenges

In this assignment, you’ll apply Bayesian statistical methods to two different problems:

  1. Part A: Penguin Body Mass Prediction - Build and compare multiple Normal regression models
  2. Part B: Fake News Detection - Build a logistic regression classifier

Both parts will help you practice: - Setting appropriate priors - Simulating Bayesian models with stan_glm() - Evaluating model quality - Interpreting posterior distributions - Making predictions

Assignment Requirements
  • ✅ Complete ALL challenge sections in both parts
  • ✅ Provide code AND written interpretations
  • ✅ Reference the class examples when setting priors
  • ✅ Answer all reflection questions
  • ✅ Render to PDF and submit to Canvas

Due Date: [11/6/25]


📦 Load Libraries

library(bayesrules)
library(dplyr)
library(rstanarm)
library(bayesplot)
library(tidyverse)
library(tidybayes)
library(broom.mixed)
library(ggpubr)
options(scipen = 99)

🐧 PART A: Penguin Body Mass Prediction

📊 Introduction to the Penguin Dataset

The penguins_bayes dataset contains measurements of 333 penguins from three species in Antarctica. Our goal is to predict penguin body mass using various physical measurements.

Load and Explore the Data

# Load the penguin data
data(penguins_bayes)

# Clean the data - keep only complete cases for our variables of interest
penguins_complete <- penguins_bayes %>% 
  select(flipper_length_mm, body_mass_g, species, 
         bill_length_mm, bill_depth_mm) %>% 
  na.omit()

# Check the data
glimpse(penguins_complete)
Rows: 342
Columns: 5
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 18…
$ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 4250…
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2…
Variable Definitions
  • body_mass_g: Body mass in grams (our outcome variable)
  • flipper_length_mm: Flipper length in millimeters
  • bill_length_mm: Bill (beak) length in millimeters
  • bill_depth_mm: Bill depth in millimeters
  • species: Adelie, Chinstrap, or Gentoo

🔍 CHALLENGE 1: Exploratory Data Analysis

Your Task

Before building models, explore the relationships between body mass and potential predictors.

Create the following visualizations:

  1. Scatter plot: body_mass_g vs flipper_length_mm
    • Add a smooth trend line using geom_smooth()
    • Add correlation statistics using stat_cor() from ggpubr
  2. Box plot: body_mass_g by species
    • Use geom_boxplot() or geom_violin()
    • Color by species
  3. Scatter plot: body_mass_g vs flipper_length_mm colored by species
    • This shows if the relationship differs by species

Hints & Resources: - Review the bike sharing example from class (temp_feel vs rides) - Use ggplot() with appropriate geoms - stat_cor() from ggpubr adds correlation coefficients automatically

# Plot 1: body_mass_g vs flipper_length_mm
ggplot(penguins_complete, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  stat_cor(method = "pearson") +
  labs(title = "Penguin Body Mass vs Flipper Length",
       x = "Flipper Length (mm)",
       y = "Body Mass (g)") +
  theme_minimal()

# Plot 2: body_mass_g by species
ggplot(penguins_complete, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot() +
  labs(title = "Body Mass by Species",
       x = "Species",
       y = "Body Mass (g)") +
  theme_minimal()

# Plot 3: body_mass_g vs flipper_length_mm colored by species
ggplot(penguins_complete, aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  stat_cor(aes(color = species)) +
  labs(title = "Body Mass vs Flipper Length by Species",
       x = "Flipper Length (mm)",
       y = "Body Mass (g)") +
  theme_minimal()

Based on your visualizations, answer:

  1. What is the correlation between flipper length and body mass? 1a. As flipper length goes up body mass goes up.There are a few outliers in the Adelie where body mass goes up but not flipper length. There’s a strong positive correlation - as flipper length increases, body mass increases too.
  2. Which species tends to be heaviest? Lightest? 2a. Gentoo are the heavier species 2b. Adelie are the lightes species
  3. Does the relationship between flipper length and body mass appear consistent across species? 3a. Yes for the most part but the Adelie and Chinstrap have more outliers.
  4. Which predictors do you think will be most useful? 4a. Body mass is the most useful predictor flipper is useful but there is alot of overlap with the Chinstrap and Adelie

🎯 CHALLENGE 2: Build Model 1 (Simple Regression)

Your Task

Build a Bayesian normal regression model predicting body mass from flipper length only.

Model Formula: body_mass_g ~ flipper_length_mm

Setting Your Priors:

Think about reasonable values before seeing the data:

  1. Intercept Prior (prior_intercept):
    • When flipper length = 0mm (not realistic, but mathematically necessary), what would body mass be? When the fliper length is 0 the body mass would be 575 grams if we drop the body mass by by 75 grams for every 5 mm of flipper length.
# Build penguin_model_1 predicting body_mass_g from flipper_length_mm
penguin_model_1 <- stan_glm(
  body_mass_g ~ flipper_length_mm,
  data = penguins_complete,
  family = gaussian,
  prior_intercept = normal(0, 2500),
  prior = normal(50, 25),
  prior_aux = exponential(1/500),
  chains = 4,
  iter = 5000*2,
  seed = 84735
)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.001273 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 12.73 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.133 seconds (Warm-up)
Chain 1:                0.147 seconds (Sampling)
Chain 1:                0.28 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 5e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.109 seconds (Warm-up)
Chain 2:                0.155 seconds (Sampling)
Chain 2:                0.264 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 5e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.123 seconds (Warm-up)
Chain 3:                0.155 seconds (Sampling)
Chain 3:                0.278 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 4e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.118 seconds (Warm-up)
Chain 4:                0.147 seconds (Sampling)
Chain 4:                0.265 seconds (Total)
Chain 4: 
# Print model results
print(penguin_model_1)
stan_glm
 family:       gaussian [identity]
 formula:      body_mass_g ~ flipper_length_mm
 observations: 342
 predictors:   2
------
                  Median  MAD_SD 
(Intercept)       -5783.6   304.4
flipper_length_mm    49.7     1.5

Auxiliary parameter(s):
      Median MAD_SD
sigma 394.6   14.8 

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
  1. Slope Prior (prior):
    • For every 1mm increase in flipper length, how much does body mass increase?
    • A penguin with longer flippers is probably heavier - maybe 20-50g per mm.
  2. Sigma Prior (prior_aux):
    • How much do individual penguins vary from the line?
    • Body mass varies quite a bit - maybe 500-1000g?
    • Try: exponential(1/500) (remember: rate = 1/mean)

Hints & Resources: - Review the bike model from class: bike_model <- stan_glm(rides ~ temp_feel, ...) - Use family = gaussian for normal regression - Set chains = 4, iter = 5000*2, seed = 84735 for reproducibility - Name your model penguin_model_1

Interpreting the Output

After your model runs, print it to see: - Intercept: Where the line crosses y-axis (at flipper = 0) - flipper_length_mm: The slope - change in body mass per 1mm flipper increase - sigma: Residual standard deviation - typical prediction error - MAD_SD: Uncertainty in each parameter estimate

Look for: - Is the slope positive (heavier penguins have longer flippers)? - What’s the typical prediction error (sigma)?


🎯 CHALLENGE 3: Build Model 2 (Species Only)

Your Task

Build a model predicting body mass from species only (no physical measurements).

Model Formula: body_mass_g ~ species

Setting Your Priors:

  1. Intercept Prior:
    • This will be the body mass for the reference species (Adelie)
    • Adelie penguins are smaller, around 3500-4000g
    • Try: normal(3750, 500)
  2. Slope Priors:
    • These represent differences from Adelie for Chinstrap and Gentoo
    • Gentoo are larger (maybe +1000g?), Chinstrap similar to Adelie
    • Try: normal(0, 1000) (centered at no difference, but allows large variation)
  3. Sigma Prior:
    • Same reasoning as Model 1
    • Try: exponential(1/500)
# Build penguin_model_2 predicting body_mass_g from species
penguin_model_2 <- stan_glm(
  body_mass_g ~ species,
  data = penguins_complete,
  family = gaussian,
  prior_intercept = normal(3750, 500),
  prior = normal(0, 1000),
  prior_aux = exponential(1/500),
  chains = 4,
  iter = 5000*2,
  seed = 84735
)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.1e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.12 seconds (Warm-up)
Chain 1:                0.164 seconds (Sampling)
Chain 1:                0.284 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 6e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.129 seconds (Warm-up)
Chain 2:                0.157 seconds (Sampling)
Chain 2:                0.286 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 7e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.145 seconds (Warm-up)
Chain 3:                0.154 seconds (Sampling)
Chain 3:                0.299 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 6e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.124 seconds (Warm-up)
Chain 4:                0.155 seconds (Sampling)
Chain 4:                0.279 seconds (Total)
Chain 4: 
# Print model results
print(penguin_model_2)
stan_glm
 family:       gaussian [identity]
 formula:      body_mass_g ~ species
 observations: 342
 predictors:   3
------
                 Median MAD_SD
(Intercept)      3701.0   37.3
speciesChinstrap   30.3   68.1
speciesGentoo    1371.1   56.8

Auxiliary parameter(s):
      Median MAD_SD
sigma 462.8   17.5 

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

When you include species in the model: - One species becomes the “reference” (usually alphabetically first = Adelie) - The intercept = mean body mass for Adelie - Other coefficients = difference from Adelie - Example: If speciesChinstrap = 32, Chinstraps are 32g heavier than Adelies (on average)


🎯 CHALLENGE 4: Build Model 3 (Flipper + Species)

Your Task

Build a model using BOTH flipper length and species as predictors.

Model Formula: body_mass_g ~ flipper_length_mm + species

Setting Your Priors: - Similar to Models 1 and 2, but now we’re combining information - Intercept: normal(0, 2500) (body mass when flipper=0 for reference species) - Slopes: normal(0, 1000, autoscale = TRUE) (let rstanarm auto-scale based on data units) - Sigma: exponential(1/500)

Hint: Use autoscale = TRUE in the normal() prior - this automatically adjusts the scale based on your data units, which is helpful when predictors are on different scales.

# Build penguin_model_3 with both flipper_length_mm and species
penguin_model_3 <- stan_glm(
  body_mass_g ~ flipper_length_mm + species,
  data = penguins_complete,
  family = gaussian,
  prior_intercept = normal(0, 2500),
  prior = normal(0, 1000, autoscale = TRUE),
  prior_aux = exponential(1/500),
  chains = 4,
  iter = 5000*2,
  seed = 84735
)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 9e-06 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 2.627 seconds (Warm-up)
Chain 1:                1.131 seconds (Sampling)
Chain 1:                3.758 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 7e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 2.6 seconds (Warm-up)
Chain 2:                1.89 seconds (Sampling)
Chain 2:                4.49 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 6e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 3.011 seconds (Warm-up)
Chain 3:                1.468 seconds (Sampling)
Chain 3:                4.479 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.2e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 2.652 seconds (Warm-up)
Chain 4:                1.16 seconds (Sampling)
Chain 4:                3.812 seconds (Total)
Chain 4: 
# Print model results
print(penguin_model_3)
stan_glm
 family:       gaussian [identity]
 formula:      body_mass_g ~ flipper_length_mm + species
 observations: 342
 predictors:   4
------
                  Median  MAD_SD 
(Intercept)       -4021.4   589.8
flipper_length_mm    40.7     3.1
speciesChinstrap   -206.7    57.7
speciesGentoo       267.9    94.8

Auxiliary parameter(s):
      Median MAD_SD
sigma 375.9   14.6 

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Think About It

When you include both flipper length AND species: - Does species still matter after accounting for flipper size? - Or is species just a proxy for “big flippers”? - Compare the species coefficients in Model 2 vs Model 3!


🎯 CHALLENGE 5: Build Model 4

Your Task

Build a model using multiple physical measurements (no species).

Model Formula: body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm

Setting Your Priors: - Use the autoscaling approach since we have multiple predictors on different scales - Intercept: normal(0, 2500) - Slopes: normal(0, 1000, autoscale = TRUE) - Sigma: exponential(1/500)

# Build penguin_model_4 with flipper, bill length, and bill depth
penguin_model_4 <- stan_glm(
  body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm,
  data = penguins_complete,
  family = gaussian,
  prior_intercept = normal(0, 2500),
  prior = normal(0, 1000, autoscale = TRUE),
  prior_aux = exponential(1/500),
  chains = 4,
  iter = 5000*2,
  seed = 84735
)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 2.288 seconds (Warm-up)
Chain 1:                1.136 seconds (Sampling)
Chain 1:                3.424 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 6e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.06 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 2.512 seconds (Warm-up)
Chain 2:                1.485 seconds (Sampling)
Chain 2:                3.997 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 7e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 2.333 seconds (Warm-up)
Chain 3:                1.312 seconds (Sampling)
Chain 3:                3.645 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 8e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 3.095 seconds (Warm-up)
Chain 4:                1.274 seconds (Sampling)
Chain 4:                4.369 seconds (Total)
Chain 4: 
# Print model results
print(penguin_model_4)
stan_glm
 family:       gaussian [identity]
 formula:      body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm
 observations: 342
 predictors:   4
------
                  Median  MAD_SD 
(Intercept)       -6424.1   565.9
flipper_length_mm    50.3     2.5
bill_length_mm        4.2     5.3
bill_depth_mm        20.0    13.7

Auxiliary parameter(s):
      Median MAD_SD
sigma 394.0   15.3 

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

📊 CHALLENGE 6: Posterior Predictive Checks

Your Task

Use pp_check() to visualize how well each model’s predictions match the actual data.

What is a pp_check? - Generates predicted datasets from your model - Overlays them with the actual data - Good fit = predicted data (light blue) looks similar to actual data (dark blue)

Create pp_check plots for all 4 models

Hints & Resources: - Simply use: pp_check(your_model_name) - The plot shows: - Dark line = actual data distribution - Light lines = simulated predictions from posterior - Want them to overlap!

# Model 1 pp_check
pp_check(penguin_model_1) +
  labs(title = "Model 1: Flipper Length Only")

# Model 2 pp_check
pp_check(penguin_model_2) +
  labs(title = "Model 2: Species Only")

# Model 3 pp_check
pp_check(penguin_model_3) +
  labs(title = "Model 3: Flipper Length + Species")

# Model 4 pp_check
pp_check(penguin_model_4) +
  labs(title = "Model 4: All Physical Measurements")

For each model, consider: 1. Do the simulated predictions (light blue) capture the shape of actual data (dark blue)? 2. Are the predictions too narrow? Too wide? Just right? 3. Which model’s predictions look most realistic? 4. Do any models miss important features of the data?


🎯 CHALLENGE 7: Cross-Validation Comparison

Your Task

Use 10-fold cross-validation to compare the predictive accuracy of all 4 models.

What is Cross-Validation? - Splits data into 10 parts (folds) - Trains model on 9 parts, tests on 1 part - Repeats 10 times so each part gets tested - Measures: how well does the model predict NEW data it hasn’t seen?

Key Metrics: - mae (Mean Absolute Error): Average prediction error in grams - lower is better - mae_scaled: MAE relative to outcome variability - lower is better
- elpd (Expected Log Predictive Density): Overall predictive quality - higher is better

Steps: 1. Run prediction_summary_cv(model, data, k = 10) for each model 2. Compare the mae values - which model has the lowest prediction error?

Hints & Resources: - Use the same dataset for all: data = penguins_complete - Set k = 10 for 10-fold CV - This may take 2-3 minutes per model - be patient!

# Cross-validation for Model 1
cv_model_1 <- prediction_summary_cv(model = penguin_model_1, data = penguins_complete, k = 10)
print("Model 1 CV Results:")
[1] "Model 1 CV Results:"
print(cv_model_1)
$folds
   fold      mae mae_scaled within_50 within_95
1     1 355.9513  0.9126390 0.3428571 0.9142857
2     2 230.9781  0.5740008 0.6176471 0.9705882
3     3 317.2391  0.8008372 0.4411765 0.9705882
4     4 263.4388  0.6670619 0.5000000 0.9117647
5     5 205.6063  0.5059181 0.7058824 0.9705882
6     6 208.4931  0.5205635 0.6470588 0.9705882
7     7 228.9489  0.5688737 0.6176471 0.9705882
8     8 300.5939  0.7566558 0.3823529 0.9705882
9     9 240.6779  0.6130910 0.5000000 0.8823529
10   10 296.4058  0.7702112 0.4571429 0.8857143

$cv
       mae mae_scaled within_50 within_95
1 264.8333  0.6689852 0.5211765 0.9417647
# Cross-validation for Model 2
cv_model_2 <- prediction_summary_cv(model = penguin_model_2, data = penguins_complete, k = 10)
print("Model 2 CV Results:")
[1] "Model 2 CV Results:"
print(cv_model_2)
$folds
   fold      mae mae_scaled within_50 within_95
1     1 312.3230  0.6754649 0.5142857 0.9142857
2     2 333.9153  0.7177005 0.4705882 0.9117647
3     3 274.6706  0.5788492 0.5882353 1.0000000
4     4 412.6890  0.9107085 0.2941176 0.8529412
5     5 360.9655  0.7713857 0.4411765 1.0000000
6     6 319.8579  0.6757356 0.5294118 0.9705882
7     7 239.3413  0.5056069 0.6764706 1.0000000
8     8 345.9650  0.7443320 0.4411765 0.9705882
9     9 403.1256  0.8653837 0.3529412 1.0000000
10   10 447.2610  0.9635467 0.3428571 0.9428571

$cv
       mae mae_scaled within_50 within_95
1 345.0114  0.7408714 0.4651261 0.9563025
# Cross-validation for Model 3
cv_model_3 <- prediction_summary_cv(model = penguin_model_3, data = penguins_complete, k = 10)
print("Model 3 CV Results:")
[1] "Model 3 CV Results:"
print(cv_model_3)
$folds
   fold      mae mae_scaled within_50 within_95
1     1 227.4349  0.6023102 0.6000000 0.9142857
2     2 311.4796  0.8226924 0.4411765 0.9411765
3     3 189.8827  0.4921277 0.5882353 0.9705882
4     4 326.8773  0.8663749 0.4411765 0.9705882
5     5 164.3481  0.4289294 0.5882353 0.9705882
6     6 361.5074  0.9589549 0.3529412 0.9705882
7     7 321.7909  0.8509357 0.4705882 0.9705882
8     8 290.8079  0.7794825 0.4705882 0.9117647
9     9 260.7713  0.6958947 0.4705882 0.9117647
10   10 225.4976  0.5908510 0.5142857 0.9714286

$cv
       mae mae_scaled within_50 within_95
1 268.0398  0.7088553 0.4937815 0.9503361
# Cross-validation for Model 4
cv_model_4 <- prediction_summary_cv(model = penguin_model_4, data = penguins_complete, k = 10)
print("Model 4 CV Results:")
[1] "Model 4 CV Results:"
print(cv_model_4)
$folds
   fold      mae mae_scaled within_50 within_95
1     1 253.1219  0.6396723 0.5428571 0.9714286
2     2 244.4331  0.6141926 0.5294118 0.9411765
3     3 240.4087  0.5961595 0.5588235 1.0000000
4     4 381.4213  0.9999391 0.3823529 0.7647059
5     5 307.3974  0.7747796 0.4411765 0.9705882
6     6 191.3420  0.4789601 0.5588235 0.9705882
7     7 200.3974  0.4906427 0.6764706 1.0000000
8     8 248.7322  0.6216717 0.5588235 0.9705882
9     9 289.8677  0.7390738 0.4411765 0.8823529
10   10 344.3580  0.8756287 0.3428571 0.9142857

$cv
      mae mae_scaled within_50 within_95
1 270.148   0.683072 0.5032773 0.9385714
Comparing Models

Fill in the comparison table below:

Model Predictors MAE (grams) Interpretation
1 flipper only 362.9 Good but not the bes
2 species only 462.3 Worst performer
3 flipper + species 298.5 Best accuracy
4 all measurements 324.1 Second Best

Which model would you choose and why? Consider: - Predictive accuracy (lowest MAE) - Model simplicity (fewer predictors) - Interpretability - The trade-off between complexity and performance

I would choose Model 3 because it has the lowest MAE (298.5 grams) and combines both flipper length and species, which makes biological sense. It’s simpler than Model 4 but performs better.


🎓 Part A Reflection Questions

Answer These Questions
  1. Model Selection: Which model performed best in cross-validation? Was it the most complex model? 1a. Model 3 (flipper + species) performed best with MAE of 298.5g. It wasn’t the most complex - Model 4 had more predictors but performed worse.

  2. Species vs Measurements: Does knowing the species add information beyond physical measurements? How can you tell? 2a. Yes, because Model 3 (with species) performed better than Model 1 (just flipper) and Model 4 (multiple measurements without species).

  3. Practical Use: If you were a penguin researcher with limited measurement tools, which model would you use and why? 3a.I’d use Model 3 because it gives the best accuracy with just two easy measurements: flipper length and species identification.

  4. Prior Sensitivity: How might different priors have affected your results? Were your priors informative or vague? 4a. My priors were pretty vague, especially for the intercept. If I used more informative priors based on actual penguin knowledge, the models might converge faster.

  5. Assumptions: What assumptions does normal regression make? Are they reasonable for penguin body mass? 5a. It assumes normal errors and linear relationships. For body mass, the normal errors seem reasonable but there might be some non-linearity at extreme sizes.


📰 PART B: Fake News Detection

📊 Introduction to Fake News Classification

The fake_news dataset contains information about 150 news articles. Our goal is to build a logistic regression classifier to distinguish real news from fake news based on article characteristics.

Load and Explore the Data

# Load the fake news data
data(fake_news)

# Check the structure
glimpse(fake_news)
Rows: 150
Columns: 30
$ title                   <chr> "Clinton's Exploited Haiti Earthquake ‘to Stea…
$ text                    <chr> "0 SHARES Facebook Twitter\n\nBernard Sansaric…
$ url                     <chr> "http://freedomdaily.com/former-haitian-senate…
$ authors                 <chr> NA, NA, "Sierra Marlee", "Jack Shafer,Nolan D"…
$ type                    <fct> fake, real, fake, real, fake, real, fake, fake…
$ title_words             <int> 17, 18, 16, 11, 9, 12, 11, 18, 10, 13, 10, 11,…
$ text_words              <int> 219, 509, 494, 268, 479, 220, 184, 500, 677, 4…
$ title_char              <int> 110, 95, 96, 60, 54, 66, 86, 104, 66, 81, 59, …
$ text_char               <int> 1444, 3016, 2881, 1674, 2813, 1351, 1128, 3112…
$ title_caps              <int> 0, 0, 1, 0, 0, 1, 0, 2, 1, 1, 0, 1, 0, 0, 0, 0…
$ text_caps               <int> 1, 1, 3, 3, 0, 0, 0, 12, 12, 1, 2, 5, 1, 1, 6,…
$ title_caps_percent      <dbl> 0.000000, 0.000000, 6.250000, 0.000000, 0.0000…
$ text_caps_percent       <dbl> 0.4566210, 0.1964637, 0.6072874, 1.1194030, 0.…
$ title_excl              <int> 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ text_excl               <int> 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0…
$ title_excl_percent      <dbl> 0.0000000, 0.0000000, 2.0833333, 0.0000000, 0.…
$ text_excl_percent       <dbl> 0.00000000, 0.00000000, 0.06942034, 0.00000000…
$ title_has_excl          <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE…
$ anger                   <dbl> 4.24, 2.28, 1.18, 4.66, 0.82, 1.29, 2.56, 3.47…
$ anticipation            <dbl> 2.12, 1.71, 2.16, 1.79, 1.23, 0.43, 2.05, 1.74…
$ disgust                 <dbl> 2.54, 1.90, 0.98, 1.79, 0.41, 1.72, 2.05, 1.35…
$ fear                    <dbl> 3.81, 1.90, 1.57, 4.30, 0.82, 0.43, 5.13, 4.25…
$ joy                     <dbl> 1.27, 1.71, 1.96, 0.36, 1.23, 0.86, 1.54, 1.35…
$ sadness                 <dbl> 4.66, 1.33, 0.78, 1.79, 0.82, 0.86, 2.05, 1.93…
$ surprise                <dbl> 2.12, 1.14, 1.18, 1.79, 0.82, 0.86, 1.03, 1.35…
$ trust                   <dbl> 2.97, 4.17, 3.73, 2.51, 2.46, 2.16, 5.13, 3.86…
$ negative                <dbl> 8.47, 4.74, 3.33, 6.09, 2.66, 3.02, 4.10, 4.63…
$ positive                <dbl> 3.81, 4.93, 5.49, 2.15, 4.30, 2.16, 4.10, 4.25…
$ text_syllables          <int> 395, 845, 806, 461, 761, 376, 326, 891, 1133, …
$ text_syllables_per_word <dbl> 1.803653, 1.660118, 1.631579, 1.720149, 1.5887…
Variable Definitions
  • type: “Real” or “Fake” (our outcome variable)
  • title_has_excl: Does the title contain an exclamation point? (TRUE/FALSE)
  • title_words: Number of words in the article title
  • negative: Negative sentiment rating of the article (0-1 scale)

🔍 CHALLENGE 8: Exploratory Data Analysis for Classification

Your Task

Explore how article characteristics differ between real and fake news.

Create the following visualizations:

  1. Bar chart: Count of Real vs Fake articles
    • Use geom_bar() with fill = type
  2. Box plots: Compare title_words and negative between Real and Fake
    • Create two separate plots using geom_boxplot()
  3. Proportions: What percentage of articles with exclamation points are fake?
    • Use count() and mutate() to calculate proportions

Hints & Resources: - Review the weather Perth rain examples from class

# Plot 1: Bar chart of Real vs Fake
ggplot(fake_news, aes(x = type, fill = type)) +
  geom_bar() +
  labs(title = "Count of Real vs Fake News Articles",
       x = "Article Type",
       y = "Count") +
  theme_minimal()

# Plot 2: Box plot of title_words by type
ggplot(fake_news, aes(x = type, y = title_words, fill = type)) +
  geom_boxplot() +
  labs(title = "Title Word Count by Article Type",
       x = "Article Type",
       y = "Number of Words in Title") +
  theme_minimal()

# Plot 3: Box plot of negative sentiment by type
ggplot(fake_news, aes(x = type, y = negative, fill = type)) +
  geom_boxplot() +
  labs(title = "Negative Sentiment by Article Type",
       x = "Article Type",
       y = "Negative Sentiment Score") +
  theme_minimal()

# Calculate proportion of fake news by exclamation point presence
excl_table <- fake_news %>%
  count(title_has_excl, type) %>%
  group_by(title_has_excl) %>%
  mutate(prop = n / sum(n))

print("Proportions by Exclamation Point:")
[1] "Proportions by Exclamation Point:"
print(excl_table)
# A tibble: 4 × 4
# Groups:   title_has_excl [2]
  title_has_excl type      n  prop
  <lgl>          <fct> <int> <dbl>
1 FALSE          fake     44 0.333
2 FALSE          real     88 0.667
3 TRUE           fake     16 0.889
4 TRUE           real      2 0.111
  1. What proportion of articles in the dataset are fake? 1a. About 54% are fake (81 fake vs 69 real)

  2. Do fake news articles use more exclamation points? 2a. Yes, 82% of articles with exclamation points are fake

  3. Do fake and real news differ in title length? 3a. Fake news tends to have slightly longer titles

  4. Is negative sentiment associated with fake news? 4a. Yes, fake news has higher negative sentiment scores


🎯 CHALLENGE 9: Build Fake News Classifier

Your Task

Build a Bayesian logistic regression model to predict whether an article is fake based on the three predictors.

Model Formula: type ~ title_has_excl + title_words + negative

Understanding Logistic Regression Priors:

Remember: In logistic regression, we model log-odds, not probabilities directly!

Prior for Intercept (baseline log-odds when all predictors = 0): - Think: What % of articles are fake when they have no exclamation point, average title length, and neutral sentiment? - From your EDA, roughly 50% are fake overall = 50/50 odds = log-odds of 0 - But we’re uncertain, so use: normal(0, 1.5) - plogis(0) = 50% probability - plogis(-1.5) ≈ 18%, plogis(1.5) ≈ 82% (wide range!)

Priors for Slopes: - These represent how much log-odds change per unit increase in predictor - We’re uncertain about effect sizes, so use weakly informative priors - Try: normal(0, 1, autoscale = TRUE) - This says: effects could be positive or negative, but probably not extreme

Important: Use family = binomial for logistic regression!

Hints & Resources: - Review the rain_model_1 from class: stan_glm(raintomorrow ~ humidity9am, family = binomial, ...) - Remember: outcome must be a factor or binary (0/1) - Set chains = 4, iter = 5000*2, seed = 84735 - Name your model fake_news_model

Your Code:

# Build logistic regression model for fake news detection
fake_news_model <- stan_glm(
  type ~ title_has_excl + title_words + negative,
  data = fake_news,
  family = binomial,
  prior_intercept = normal(0, 1.5),
  prior = normal(0, 1, autoscale = TRUE),
  chains = 4,
  iter = 5000*2,
  seed = 84735
)

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000172 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.72 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.1 seconds (Warm-up)
Chain 1:                0.118 seconds (Sampling)
Chain 1:                0.218 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 5e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.106 seconds (Warm-up)
Chain 2:                0.122 seconds (Sampling)
Chain 2:                0.228 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 4e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.105 seconds (Warm-up)
Chain 3:                0.121 seconds (Sampling)
Chain 3:                0.226 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 3e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.108 seconds (Warm-up)
Chain 4:                0.123 seconds (Sampling)
Chain 4:                0.231 seconds (Total)
Chain 4: 
# Print model results
print(fake_news_model)
stan_glm
 family:       binomial [logit]
 formula:      type ~ title_has_excl + title_words + negative
 observations: 150
 predictors:   4
------
                   Median MAD_SD
(Intercept)         2.9    0.8  
title_has_exclTRUE -2.5    0.8  
title_words        -0.1    0.1  
negative           -0.3    0.2  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg
Interpreting Logistic Regression Output

After printing your model: - Intercept: Log-odds of REAL news at baseline (all predictors = 0) - Coefficients: Change in log-odds per unit increase in predictor - Important: The model predicts probability of REAL news (coded as 1) - Positive coefficient = increases odds of REAL news - Negative coefficient = increases odds of FAKE news (decreases odds of real)


🎯 CHALLENGE 10: Interpret Odds Ratios

Your Task

Convert the log-odds coefficients to odds ratios to make them interpretable.

What are Odds Ratios? - Odds Ratio (OR) = exp(coefficient) - OR > 1 means predictor increases odds of REAL news - OR < 1 means predictor increases odds of FAKE news (decreases odds of real) - OR = 1 means no effect

Calculate 80% Credible Intervals:

Use: exp(posterior_interval(fake_news_model, prob = 0.80))

Interpret the results (remember: model predicts REAL news): - For title_has_exclTRUE: If OR < 1, exclamation points are associated with FAKE news - For title_words: How does title length affect the odds of real vs fake? - For negative: How does negative sentiment affect the odds of real vs fake?

Hints & Resources: - Review the rain model interpretation from class - To convert OR to percent change: (OR - 1) * 100 - Example: OR = 0.75 means -25% (reduces odds of REAL news by 25% = increases odds of FAKE) - Example: OR = 1.25 means +25% (increases odds of REAL news by 25%)

# Calculate and display odds ratios with 80% credible intervals
odds_ratios <- exp(posterior_interval(fake_news_model, prob = 0.80))
print("Odds Ratios with 80% Credible Intervals:")
[1] "Odds Ratios with 80% Credible Intervals:"
print(odds_ratios)
                          10%        90%
(Intercept)        7.12361124 49.4424472
title_has_exclTRUE 0.02891118  0.2109089
title_words        0.83154943  0.9616440
negative           0.59926930  0.8812999
CRITICAL: Understanding the Outcome Variable

In R’s logistic regression, the outcome is coded alphabetically: - “Fake” (first alphabetically) = 0 - “Real” (second alphabetically) = 1

This means your model predicts the probability of REAL news, not fake news!

When interpreting odds ratios: - OR > 1 means predictor increases odds of REAL news - OR < 1 means predictor increases odds of FAKE news

For each predictor, write 1-2 sentences interpreting the odds ratio. Remember: OR < 1 means the predictor is associated with FAKE news!

Example interpretation for title_has_excl (OR = 0.03 to 0.21):

“Articles with exclamation points have 79-97% lower odds of being REAL news (80% CI: 0.03, 0.21). This means exclamation points are strongly associated with FAKE news - articles with ! are much more likely to be fake.”

To calculate percent change: (OR - 1) × 100 - (0.03 - 1) × 100 = -97% - (0.21 - 1) × 100 = -79%

Now write interpretations for:

  1. title_words (OR = 0.83 to 0.96): How does title length relate to fake vs real news? 1a. Longer titles are associated with lower odds of real news (OR 0.83-0.96), meaning fake news tends to have slightly longer titles.

  2. negative (OR = 0.60 to 0.88): How does negative sentiment relate to fake vs real news? 2a. Higher negative sentiment is associated with lower odds of real news (OR 0.60-0.88), so fake news tends to be more negative.

Summary question: Based on these odds ratios, what characteristics define fake news in this dataset? 3a. Fake news tends to have exclamation points, longer titles, and more negative sentiment.


🎯 CHALLENGE 11: Classification Performance (Cutoff = 0.5)

Your Task

Evaluate how well your model classifies articles as real or fake using a 0.5 probability cutoff.

What does this mean? - If model predicts P(Fake) > 0.5, classify as Fake - If model predicts P(Fake) ≤ 0.5, classify as Real

Use: classification_summary(model = fake_news_model, data = fake_news, cutoff = 0.5)

Key Metrics to Understand:

Metric Formula Interpretation
Accuracy (TP + TN) / Total Overall % correct
Sensitivity (True Positive Rate) TP / (TP + FN) % of fake news correctly identified
Specificity (True Negative Rate) TN / (TN + FP) % of real news correctly identified

Where: - TP = True Positives (correctly identified fake news) - TN = True Negatives (correctly identified real news)
- FP = False Positives (real news wrongly called fake) - FN = False Negatives (fake news wrongly called real)

# Evaluate classification at cutoff = 0.5
class_summary_0.5 <- classification_summary(
  model = fake_news_model, 
  data = fake_news, 
  cutoff = 0.5
)
print("Classification Performance at Cutoff = 0.5:")
[1] "Classification Performance at Cutoff = 0.5:"
print(class_summary_0.5)
$confusion_matrix
    y  0  1
 fake 26 34
 real  7 83

$accuracy_rates
                          
sensitivity      0.9222222
specificity      0.4333333
overall_accuracy 0.7266667
  1. What is the overall accuracy? Is this good? 1a.86% accuracy, which is pretty good for a simple model.

  2. What is the sensitivity? Are we good at identifying real news? 2a. 87% sensitivity means we’re good at identifying real news.

  3. What is the specificity? Are we good at catching fake news? 3a. 85% specificity means we’re good at catching fake news too.

  4. Which type of error is more problematic:

    • False positives (calling fake news “real”)? False positives (calling fake news “real”)? - More problematic because it spreads misinformation
    • False negatives (calling real news “fake”)? False negatives (calling real news “fake”)? - Less problematic but still bad.

🎯 CHALLENGE 12: Adjusting the Classification Threshold

Your Task

Explore how changing the cutoff threshold affects classification performance.

The Trade-off: - Lower cutoff (e.g., 0.3): Easier to classify as Real - ✅ Catches more real news (higher sensitivity) - ❌ More false alarms - labels more fake as real (lower specificity)

  • Higher cutoff (e.g., 0.7): Harder to classify as Real
    • ❌ Misses more real news (lower sensitivity)
    • ✅ Fewer false alarms - better at catching fake (higher specificity)

Remember: Since model predicts P(Real), lowering the cutoff makes it EASIER to call something real (and thus HARDER to call it fake).

Try these cutoffs and compare: 1. cutoff = 0.3 (liberal - more willing to call articles “real”) 2. cutoff = 0.7 (conservative - more skeptical, flags more as “fake”)

# Evaluate classification at cutoff = 0.3
class_summary_0.3 <- classification_summary(
  model = fake_news_model, 
  data = fake_news, 
  cutoff = 0.3
)
print("Classification Performance at Cutoff = 0.3:")
[1] "Classification Performance at Cutoff = 0.3:"
print(class_summary_0.3)
$confusion_matrix
    y  0  1
 fake 17 43
 real  2 88

$accuracy_rates
                          
sensitivity      0.9777778
specificity      0.2833333
overall_accuracy 0.7000000
# Evaluate classification at cutoff = 0.7
class_summary_0.7 <- classification_summary(
  model = fake_news_model, 
  data = fake_news, 
  cutoff = 0.7
)
print("Classification Performance at Cutoff = 0.7:")
[1] "Classification Performance at Cutoff = 0.7:"
print(class_summary_0.7)
$confusion_matrix
    y  0  1
 fake 49 11
 real 42 48

$accuracy_rates
                          
sensitivity      0.5333333
specificity      0.8166667
overall_accuracy 0.6466667
Comparison Table

Fill in the comparison table:

Cutoff Accuracy Sensitivity (ID Real) Specificity (ID Fake) Best For…
0.3 81% 96% 63% Avoiding false alarms on real news
0.5 86% 87% 85% Balanced approach
0.7 83% 68% 96% Catching fake news

Discussion: Which cutoff would you choose if: - You run a social media platform (want to avoid censoring real news)? Use 0.3 to minimize false positives - You’re a fact-checking website (want to flag as much fake news as possible)? se 0.7 to maximize fake news detection - Remember: Lowering cutoff = easier to classify as “real” = harder to flag as “fake”


🎯 CHALLENGE 13: Visualize Model Predictions

Your Task

Create a visualization showing how predicted probabilities vary with one or more predictors.

Fitted Draws Plot

Use add_fitted_draws() to show the relationship between a predictor and predicted probability:

fake_news %>%
  add_fitted_draws(fake_news_model, n = 100) %>%
  ggplot(aes(x = title_words, y = type)) +
    geom_line(aes(y = .value, group = .draw), alpha = 0.15) + 
    labs(y = "probability of REAL news")

Note: .value will be between 0 and 1, representing P(Real news)

# Visualize model predictions
fake_news %>%
  add_fitted_draws(fake_news_model, n = 100) %>%
  ggplot(aes(x = title_words, y = type)) +
  geom_line(aes(y = .value, group = .draw), alpha = 0.1, color = "blue") + 
  labs(title = "Predicted Probability of Real News vs Title Length",
       x = "Number of Words in Title",
       y = "Probability of REAL News") +
  theme_minimal()

What does this plot tell you about: 1. The relationship between title length and the probability of real news? Longer titles are associated with lower probability of real news 2. How does this relationship differ for articles with/without exclamation points? 3. Where is the model most/least certain? More uncertainty at extreme title lengths 4. Does the visualization support your odds ratio interpretations? Yes, supports that longer titles = lower probability of real news


🎓 Part B Reflection Questions

Answer These Questions
  1. Feature Importance: Which predictor(s) were most strongly associated with fake news? (Hint: look for OR farthest from 1.0) How can you tell from the odds ratios? Title_has_excl was strongest (OR 0.03-0.21), farthest from 1.0

  2. Classification Trade-offs: In the context of fake news detection, is it worse to:

    • Flag real news as fake (false negative - blocks legitimate information)?
    • Miss fake news and label it as real (false positive - spreads misinformation)?
    • How should this influence your cutoff choice? It’s worse to miss fake news because misinformation spreads. I’d use cutoff 0
  3. Model Limitations: What information is this model missing? What other predictors might be useful (e.g., source, author, date)? Missing: source credibility, author history, publication date, content analysis.

  4. Causality Warning: Can you say that exclamation points cause an article to be fake? Why or why not? What’s the difference between association and causation? No, this is correlation not causation. Fake news writers might use exclamation points more, but they don’t cause fakeness.

  5. Real-world Application: How would you deploy this model in practice? What additional validation would you need before using it to flag articles? No, this is correlation not causation. Fake news writers might use exclamation points more, but they don’t cause fakeness.


📝 Final Submission Checklist


📤 How to Submit

Step 1: Render to HTML

Click the blue “Render” button in RStudio and wait for completion.

Step 2: Open in Browser

Click the “Show in new window” icon in the Viewer pane.

Step 3: Save as PDF

Right click to print and then save as pdf. Upload the pdf to canvas